In this part we will study different aircrafts types Using PCA
the first variable was considered as a qualitative variable because it presents the different modalities of the international fleet
library(factoextra)
## Warning: package 'factoextra' was built under R version 3.4.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.4.3
## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
library(FactoMineR)
data <-read.table(file = file.choose(), sep=";", header=TRUE , dec=".")
dataFLotte <- read.table(file = file.choose(), sep=";", header=TRUE , dec=".")
View(dataFLotte)
result_pca <- PCA(data, quali.sup=1)
eig.val <- get_eigenvalue(result_pca)
fviz_eig(result_pca, addlabels = TRUE, ylim = c(0, 50))
The correlation circle summarizes 65.2% of the total information only in the first and second axis if we include the 3rd axis we will have 80% of the total information
result_pca <- PCA(data, quali.sup=1)
result_pca <- PCA(data, quali.sup=1,axes = 2:3)
result_pca <- PCA(data, quali.sup=1,axes =c(1,3))
library("corrplot")
## corrplot 0.84 loaded
var <- get_pca_var(result_pca)
corrplot(var$cos2, is.corr=FALSE)
fviz_cos2(result_pca, choice = "var", axes = 1:2)
fviz_cos2(result_pca, choice = "var", axes = 2:3)
fviz_cos2(result_pca, choice = "var", axes = c(3,1))
The plots showed us that there are 3 clouds of individuals Unfortunately the interpretation seems a little difficult given the large number of variables to be studied. For this reason, we will eliminate variables that are not relevant and have a poor representation. We will consider the individual VIP as additional (since it presents a unique case) as well as the varibales 2,3,4 (relating to this individual VIP) and 24,25,26,27,28,29,30,31 , 32,33,34 which are binary variables giving information on the types of classes in each airplane, this is redundancy because the names of the variables give information on the existing classes. EXample: seat.width..FirstClass
result=PCA(data,ind.sup =2,quanti.sup=c(2,3,4,24,25,26,27,28,29,30,31,32,33,34),quali.sup =1)
## Warning in arrows(0, 0, coord.quanti[q, 1], coord.quanti[q, 2], length =
## 0.1, : zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, coord.quanti[q, 1], coord.quanti[q, 2], length =
## 0.1, : zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, coord.quanti[q, 1], coord.quanti[q, 2], length =
## 0.1, : zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, coord.quanti[q, 1], coord.quanti[q, 2], length =
## 0.1, : zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, coord.quanti[q, 1], coord.quanti[q, 2], length =
## 0.1, : zero-length arrow is of indeterminate angle and so skipped
fviz_pca_var(result, col.var = "cos2",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Ãvite le chevauchement de texte,
)
From a business point of view, we decided to eliminate some variables such as WingSpan, tail.Height, mileRange, Engines to have accurate results, that’s why we’ll eliminate them While flying over the data, we noticed that there are two aspects to study: all that is related to the places and their classes (Business, EcoComfort ..;) and the other options.
Now , we are going to focus on the variables related to seats characteristics and try to reduce dimensionality
dataSeats=data.frame(dataFLotte$Aircraft,dataFLotte$Seats..First.Class.,dataFLotte$Seats..Business.,dataFLotte$Seats..Eco.Comfort.,dataFLotte$Seats..Economy.,dataFLotte$Accommodation)
result1=PCA(dataSeats,ind.sup =2,quali.sup =1)
fviz_pca_var(result1, col.var = "cos2",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Ãvite le chevauchement de texte,
)
var <- get_pca_var(result1)
library("corrplot")
corrplot(var$cos2, is.corr=FALSE)
fviz_cos2(result1, choice = "var", axes = 1:2)
The accomodation variable informs us of the number of seats in each plane and the other variables (Seats first class, seats business, seats Economy and Seats economy comfort) provide information on the number of seats per class in the same plane. the circle in this case sums up more than 95% of the total information with a good quality of representation.
we note that:
-Accomodation, seats Eco comfort and Seats Economy are the most corrolated positively with the first axis
-Seats First class is corrolated positively with the second axis
on the other hand Seats business is negatively correlated with the second axis and positively with the first axis
and if we project the individuals, we obtain the following information:
-a first cloud positively corrolated with the first axis representing aircraft that specialize mainly in the classes Eco comfort, Economy and less importantly in Business class
EXAMPLE: individual 14 having 376 places distributed as follows:
Economy :286 places Eco Comfort :42 places Business :48 places
-A second cloud of points correlated positively with the axis 2 which represents the planes which have more places in First class taking the example of the individual 28 having the highest number of first class places: 36 places
_a third point cloud correlated negatively with the two axes having a small number of places compared to the other planes Example the individual 38 having a total of 28 places.
fviz_pca_ind(result1,
geom.ind = "point", # Montre les points seulement (mais pas le "text")
palette = c("#00AFBB", "#E7B800", "#FC4E07"),
addEllipses = TRUE, # Ellipses de concentration
legend.title = "Groups"
)
fviz_pca_biplot(result1, axes = c(1, 2),
label = "all", invisible = "none", labelsize = 4, pointsize = 2)
View(dataSeats)
boxplot(dataSeats[,2:5],names = c("First Class","business","Eco Comfort","Economy"))
#ScaledSeats=scale(dataSeats[,2:5], center = TRUE, scale = TRUE)
#boxplot(ScaledSeats[,],names = c("First Class","business","Eco #Comfort","Economy"))
#la matrice des distances entre tt les individus
matrix=dist(dataSeats[,], method = "manhattan", diag = FALSE, upper = FALSE, p = 2)
## Warning in dist(dataSeats[, ], method = "manhattan", diag = FALSE, upper =
## FALSE, : NAs introduits lors de la conversion automatique
#clustering avec la methode compléte
h1=hclust(matrix, method = "complete", members = NULL)
plot(h1, main = "Dendrogramme avec methode d'agg complete")
#clustering avec la methode wadr D2
h2=hclust(matrix, method = "ward.D2", members = NULL)
plot(h1, main = "Dendrogramme avec methode ward D2")
we note that in the two segmentations CAH we found that the same element 14 is still an aberrated value this individuals is characterized by the largest number of seats dedicated to the economy class We will see the output of the segmentation by eliminating this element
#la matrice des distances entre tt les individus
dataSeats1=dataSeats[-c(14),]
#la matrice des distances entre tt les individus
matrix=dist(dataSeats1[,], method = "manhattan", diag = FALSE, upper = FALSE, p = 2)
## Warning in dist(dataSeats1[, ], method = "manhattan", diag = FALSE, upper =
## FALSE, : NAs introduits lors de la conversion automatique
#clustering avec la methode compléte
h1=hclust(matrix, method = "complete", members = NULL)
plot(h1, main = "Dendrogramme avec methode d'agg complete")
#clustering avec la methode wadr D2
h2=hclust(matrix, method = "ward.D2", members = NULL)
plot(h1, main = "Dendrogramme avec methode ward D2")
we had the same result by applying the method Ward D2 and the complete method
inertie <- sort(h2$height, decreasing = TRUE)
plot(inertie[1:10], type = "s", xlab = "Nombre de classes",
ylab = "Inertie")
points(c(2, 3, 4,6), inertie[c(2, 3,4,6)], col = c("green3",
"red3", "blue3","yellow3"), cex = 2, lwd = 3)
We can see that we can segment our data in 2, 3 or 4 clusters
groupes.cah <- cutree(h2,k=3 )
plot(groupes.cah)
barplot(table(groupes.cah))
dataSeatsClasseCAH3=data.frame(dataSeats1,groupes.cah)
classe1=dataSeatsClasseCAH3[which(dataSeatsClasseCAH3$groupes.cah==1),]
classe2=dataSeatsClasseCAH3[which(dataSeatsClasseCAH3$groupes.cah==2),]
classe3=dataSeatsClasseCAH3[which(dataSeatsClasseCAH3$groupes.cah==3),]
dataSeatsClasseCAH3
## dataFLotte.Aircraft dataFLotte.Seats..First.Class.
## 1 Airbus A319 12
## 2 Airbus A319 VIP 28
## 3 Airbus A320 12
## 4 Airbus A320 32-R 12
## 5 Airbus A330-200 0
## 6 Airbus A330-200 (3L2) 0
## 7 Airbus A330-200 (3L3) 0
## 8 Airbus A330-300 0
## 9 Boeing 717 12
## 10 Boeing 737-700 (73W) 12
## 11 Boeing 737-800 (738) 16
## 12 Boeing 737-800 (73H) 16
## 13 Boeing 737-900ER (739) 20
## 15 Boeing 757-200 (75A) 24
## 16 Boeing 757-200 (75E) 0
## 17 Boeing 757-200 (75M) 22
## 18 Boeing 757-200 (75N) 22
## 19 Boeing 757-200 (757) 24
## 20 Boeing 757-200 (75V) 22
## 21 Boeing 757-200 (75X) 0
## 22 Boeing 757-300 24
## 23 Boeing 767-300 (76G) 0
## 24 Boeing 767-300 (76L) 0
## 25 Boeing 767-300 (76P) 30
## 26 Boeing 767-300 (76Q) 30
## 27 Boeing 767-300 (76T) 0
## 28 Boeing 767-300 (76U) 36
## 29 Boeing 767-300 (76Z V.1) 0
## 30 Boeing 767-300 (76Z V.2) 0
## 31 Boeing 767-400 (76D) 0
## 32 Boeing 777-200ER 0
## 33 Boeing 777-200LR 0
## 34 CRJ 100/200 Pinnacle/SkyWest 0
## 35 CRJ 100/200 ExpressJet 0
## 36 CRJ 700 9
## 37 CRJ 900 12
## 38 E120 0
## 39 E170 9
## 40 E175 12
## 41 ERJ-145 0
## 42 MD-88 16
## 43 MD-90 16
## 44 MD-DC9-50 16
## dataFLotte.Seats..Business. dataFLotte.Seats..Eco.Comfort.
## 1 0 18
## 2 14 0
## 3 0 18
## 4 0 18
## 5 32 30
## 6 34 32
## 7 34 32
## 8 34 32
## 9 0 15
## 10 0 18
## 11 0 18
## 12 0 18
## 13 0 21
## 15 0 18
## 16 16 25
## 17 0 18
## 18 0 19
## 19 0 20
## 20 0 21
## 21 26 26
## 22 0 23
## 23 30 38
## 24 36 32
## 25 0 28
## 26 0 28
## 27 36 29
## 28 0 31
## 29 26 35
## 30 26 29
## 31 40 28
## 32 45 36
## 33 45 36
## 34 0 0
## 35 0 0
## 36 0 8
## 37 0 12
## 38 0 0
## 39 0 12
## 40 0 12
## 41 0 0
## 42 0 15
## 43 0 15
## 44 0 14
## dataFLotte.Seats..Economy. dataFLotte.Accommodation groupes.cah
## 1 96 126 1
## 2 0 54 2
## 3 120 150 1
## 4 120 150 1
## 5 181 243 3
## 6 168 243 3
## 7 227 293 3
## 8 232 298 3
## 9 83 110 1
## 10 94 124 1
## 11 126 124 1
## 12 126 160 1
## 13 139 180 1
## 15 132 174 1
## 16 130 171 1
## 17 141 181 1
## 18 141 182 1
## 19 136 180 1
## 20 132 175 1
## 21 132 184 1
## 22 177 224 1
## 23 140 208 1
## 24 143 211 1
## 25 203 261 3
## 26 203 261 3
## 27 143 208 1
## 28 143 210 1
## 29 171 226 3
## 30 171 226 3
## 31 178 246 3
## 32 188 269 3
## 33 188 269 3
## 34 50 50 2
## 35 50 50 2
## 36 48 65 2
## 37 52 76 2
## 38 28 28 2
## 39 48 69 2
## 40 52 76 2
## 41 50 50 2
## 42 118 149 1
## 43 129 160 1
## 44 90 120 1
number of seats in first class ;clustering with CAH 3 clusters
boxplot(classe1$dataFLotte.Seats..First.Class.,classe2$dataFLotte.Seats..First.Class.,classe3$dataFLotte.Seats..First.Class.)
number of seats in Business class ;clustering with CAH 3 clusters
boxplot(classe1$dataFLotte.Seats..Business.,classe2$dataFLotte.Seats..Business.,classe3$dataFLotte.Seats..Business.)
number of seats in Eco.Comfort class ;clustering with CAH 3 clusters
boxplot(classe1$dataFLotte.Seats..Eco.Comfort.,classe2$dataFLotte.Seats..Eco.Comfort.,classe3$dataFLotte.Seats..Eco.Comfort.)
boxplot(classe1$dataFLotte.Seats..Economy.,classe2$dataFLotte.Seats..Economy.,classe3$dataFLotte.Seats..Economy.)
groupes.cah <- cutree(h2,k=4 )
plot(groupes.cah)
barplot(table(groupes.cah))
dataSeatsClasseCAH4=data.frame(dataSeats1,groupes.cah)
classe1=dataSeatsClasseCAH4[which(dataSeatsClasseCAH4$groupes.cah==1),]
classe2=dataSeatsClasseCAH4[which(dataSeatsClasseCAH4$groupes.cah==2),]
classe3=dataSeatsClasseCAH4[which(dataSeatsClasseCAH4$groupes.cah==3),]
classe4=dataSeatsClasseCAH4[which(dataSeatsClasseCAH4$groupes.cah==4),]
dataSeatsClasseCAH4
## dataFLotte.Aircraft dataFLotte.Seats..First.Class.
## 1 Airbus A319 12
## 2 Airbus A319 VIP 28
## 3 Airbus A320 12
## 4 Airbus A320 32-R 12
## 5 Airbus A330-200 0
## 6 Airbus A330-200 (3L2) 0
## 7 Airbus A330-200 (3L3) 0
## 8 Airbus A330-300 0
## 9 Boeing 717 12
## 10 Boeing 737-700 (73W) 12
## 11 Boeing 737-800 (738) 16
## 12 Boeing 737-800 (73H) 16
## 13 Boeing 737-900ER (739) 20
## 15 Boeing 757-200 (75A) 24
## 16 Boeing 757-200 (75E) 0
## 17 Boeing 757-200 (75M) 22
## 18 Boeing 757-200 (75N) 22
## 19 Boeing 757-200 (757) 24
## 20 Boeing 757-200 (75V) 22
## 21 Boeing 757-200 (75X) 0
## 22 Boeing 757-300 24
## 23 Boeing 767-300 (76G) 0
## 24 Boeing 767-300 (76L) 0
## 25 Boeing 767-300 (76P) 30
## 26 Boeing 767-300 (76Q) 30
## 27 Boeing 767-300 (76T) 0
## 28 Boeing 767-300 (76U) 36
## 29 Boeing 767-300 (76Z V.1) 0
## 30 Boeing 767-300 (76Z V.2) 0
## 31 Boeing 767-400 (76D) 0
## 32 Boeing 777-200ER 0
## 33 Boeing 777-200LR 0
## 34 CRJ 100/200 Pinnacle/SkyWest 0
## 35 CRJ 100/200 ExpressJet 0
## 36 CRJ 700 9
## 37 CRJ 900 12
## 38 E120 0
## 39 E170 9
## 40 E175 12
## 41 ERJ-145 0
## 42 MD-88 16
## 43 MD-90 16
## 44 MD-DC9-50 16
## dataFLotte.Seats..Business. dataFLotte.Seats..Eco.Comfort.
## 1 0 18
## 2 14 0
## 3 0 18
## 4 0 18
## 5 32 30
## 6 34 32
## 7 34 32
## 8 34 32
## 9 0 15
## 10 0 18
## 11 0 18
## 12 0 18
## 13 0 21
## 15 0 18
## 16 16 25
## 17 0 18
## 18 0 19
## 19 0 20
## 20 0 21
## 21 26 26
## 22 0 23
## 23 30 38
## 24 36 32
## 25 0 28
## 26 0 28
## 27 36 29
## 28 0 31
## 29 26 35
## 30 26 29
## 31 40 28
## 32 45 36
## 33 45 36
## 34 0 0
## 35 0 0
## 36 0 8
## 37 0 12
## 38 0 0
## 39 0 12
## 40 0 12
## 41 0 0
## 42 0 15
## 43 0 15
## 44 0 14
## dataFLotte.Seats..Economy. dataFLotte.Accommodation groupes.cah
## 1 96 126 1
## 2 0 54 2
## 3 120 150 1
## 4 120 150 1
## 5 181 243 3
## 6 168 243 3
## 7 227 293 3
## 8 232 298 3
## 9 83 110 1
## 10 94 124 1
## 11 126 124 1
## 12 126 160 1
## 13 139 180 4
## 15 132 174 4
## 16 130 171 4
## 17 141 181 4
## 18 141 182 4
## 19 136 180 4
## 20 132 175 4
## 21 132 184 4
## 22 177 224 4
## 23 140 208 4
## 24 143 211 4
## 25 203 261 3
## 26 203 261 3
## 27 143 208 4
## 28 143 210 4
## 29 171 226 3
## 30 171 226 3
## 31 178 246 3
## 32 188 269 3
## 33 188 269 3
## 34 50 50 2
## 35 50 50 2
## 36 48 65 2
## 37 52 76 2
## 38 28 28 2
## 39 48 69 2
## 40 52 76 2
## 41 50 50 2
## 42 118 149 1
## 43 129 160 1
## 44 90 120 1
number of seats in first class ;clustering CAH with 4 class
boxplot(classe1$dataFLotte.Seats..First.Class.,classe2$dataFLotte.Seats..First.Class.,classe3$dataFLotte.Seats..First.Class.,classe4$dataFLotte.Seats..First.Class.)
number of seats in Business class;clustering CAH with 4 class
boxplot(classe1$dataFLotte.Seats..Business.,classe2$dataFLotte.Seats..Business.,classe3$dataFLotte.Seats..Business.,classe4$dataFLotte.Seats..Business.)
number of seats in Eco.Comfort class;clustering CAH with 4 class
boxplot(classe1$dataFLotte.Seats..Eco.Comfort.,classe2$dataFLotte.Seats..Eco.Comfort.,classe3$dataFLotte.Seats..Eco.Comfort.,classe4$dataFLotte.Seats..Eco.Comfort.)
number of seats in Economy class;clustering CAH with 4 class
boxplot(classe1$dataFLotte.Seats..Economy.,classe2$dataFLotte.Seats..Economy.,classe3$dataFLotte.Seats..Economy.,classe4$dataFLotte.Seats..Economy.)
now project the classes on the factorial plane using PCA to have more information of caracteristics of our clusters
res <- HCPC(result1,graph = FALSE)
plot(res, choice = "3D.map")
plot(res, choice = "tree")
plot(res, choice = "bar")
plot(res, choice = "map")
fviz_dend(res,
cex = 0.7, # Taille du text
palette = "jco", # Palette de couleur ?ggpubr::ggpar
rect = TRUE, rect_fill = TRUE, # Rectangle autour des groupes
rect_border = "jco", # Couleur du rectangle
labels_track_height = 0.8 )
fviz_cluster(res,
repel = TRUE, # Evite le chevauchement des textes
show.clust.cent = TRUE, # Montre le centre des clusters
palette = "jco", # Palette de couleurs, voir ?ggpubr::ggpar
ggtheme = theme_minimal(),
main = "Factor map"
)
result=kmeans(dataSeats[,2:6],3)
result
## K-means clustering with 3 clusters of sizes 10, 13, 21
##
## Cluster means:
## dataFLotte.Seats..First.Class. dataFLotte.Seats..Business.
## 1 8.200000 1.400000
## 2 6.461538 28.000000
## 3 14.190476 6.857143
## dataFLotte.Seats..Eco.Comfort. dataFLotte.Seats..Economy.
## 1 5.90000 46.1000
## 2 31.61538 197.9231
## 3 21.42857 127.1905
## dataFLotte.Accommodation
## 1 62.8000
## 2 264.2308
## 3 167.9524
##
## Clustering vector:
## [1] 3 1 3 3 2 2 2 2 1 3 3 3 3 2 3 3 3 3 3 3 3 2 3 3 2 2 3 3 2 2 2 2 2 1 1
## [36] 1 1 1 1 1 1 3 3 3
##
## Within cluster sum of squares by cluster:
## [1] 9591.40 39501.54 28307.14
## (between_SS / total_SS = 82.7 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
seatsclasse=data.frame(dataSeats,result$cluster)
View(seatsclasse)
classe1=seatsclasse[which(seatsclasse$result.cluster==1),]
classe2=seatsclasse[which(seatsclasse$result.cluster==2),]
classe3=seatsclasse[which(seatsclasse$result.cluster==3),]
number of seats in first class
boxplot(classe1$dataFLotte.Seats..First.Class.,classe2$dataFLotte.Seats..First.Class.,classe3$dataFLotte.Seats..First.Class.)
number of seats in business class
boxplot(classe1$dataFLotte.Seats..Business.,classe2$dataFLotte.Seats..Business.,classe3$dataFLotte.Seats..Business.)
number of seats in Eco Comfort
boxplot(classe1$dataFLotte.Seats..Eco.Comfort.,classe2$dataFLotte.Seats..Eco.Comfort.,classe3$dataFLotte.Seats..Eco.Comfort.)
number of seats in Economic class
boxplot(classe1$dataFLotte.Seats..Economy.,classe2$dataFLotte.Seats..Economy.,classe3$dataFLotte.Seats..Economy.)
number of total seats for each cluster
boxplot(classe1$dataFLotte.Accommodation,classe2$dataFLotte.Accommodation,classe3$dataFLotte.Accommodation)
this segmentation reminds us of the one made by PCA
seatsclasse[,7]=as.factor(seatsclasse[,7])
result2=PCA(seatsclasse,ind.sup =2,quali.sup =c(1,7))
plot.PCA(result2,choix="ind",habillage=7)
## PCA about secondary options :
library(factoextra)
library(FactoMineR)
#data:
dataFLotteopt = read.table(file = file.choose(), sep=";", header=TRUE , dec=".")
View(dataFLotteopt)
summary(dataFLotteopt)
## Aircraft Seat.Width..Club. Seat.Pitch..Club.
## Airbus A319 : 1 Min. : 0.0000 Min. : 0
## Airbus A319 VIP : 1 1st Qu.: 0.0000 1st Qu.: 0
## Airbus A320 : 1 Median : 0.0000 Median : 0
## Airbus A320 32-R : 1 Mean : 0.4409 Mean : 1
## Airbus A330-200 : 1 3rd Qu.: 0.0000 3rd Qu.: 0
## Airbus A330-200 (3L2): 1 Max. :19.4000 Max. :44
## (Other) :38
## Seat..Club. Seat.Width..First.Class. Seat.Pitch..First.Class.
## Min. : 0.0000 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.0000 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 0.0000 Median :18.95 Median :36.00
## Mean : 0.2727 Mean :11.49 Mean :21.85
## 3rd Qu.: 0.0000 3rd Qu.:21.00 3rd Qu.:37.00
## Max. :12.0000 Max. :21.00 Max. :60.00
##
## Seats..First.Class. Seat.Width..Business. Seat.Pitch..Business.
## Min. : 0.00 Min. : 0.000 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.00
## Median :12.00 Median : 0.000 Median : 0.00
## Mean :10.55 Mean : 7.545 Mean :25.55
## 3rd Qu.:17.00 3rd Qu.:21.000 3rd Qu.:60.00
## Max. :36.00 Max. :21.000 Max. :82.00
##
## Seats..Business. Seat.Width..Eco.Comfort. Seat.Pitch..Eco.Comfort.
## Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.:17.20 1st Qu.:34.00
## Median : 0.00 Median :17.30 Median :34.00
## Mean :11.86 Mean :15.65 Mean :30.58
## 3rd Qu.:27.00 3rd Qu.:18.00 3rd Qu.:35.00
## Max. :48.00 Max. :18.50 Max. :36.00
##
## Seats..Eco.Comfort. Seat.Width..Economy. Seat.Pitch..Economy.
## Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.:15.00 1st Qu.:17.20 1st Qu.:30.50
## Median :19.50 Median :17.60 Median :31.00
## Mean :20.91 Mean :17.25 Mean :30.41
## 3rd Qu.:29.25 3rd Qu.:18.00 3rd Qu.:31.50
## Max. :42.00 Max. :18.50 Max. :32.00
##
## Seats..Economy. Accommodation Cruising.Speed..mph. Range..miles.
## Min. : 0.0 Min. : 28.0 Min. :364.0 Min. : 731
## 1st Qu.: 93.0 1st Qu.:123.0 1st Qu.:517.0 1st Qu.: 1944
## Median :132.0 Median :177.5 Median :517.0 Median : 3174
## Mean :129.7 Mean :172.5 Mean :515.8 Mean : 3869
## 3rd Qu.:171.0 3rd Qu.:226.0 3rd Qu.:517.0 3rd Qu.: 6221
## Max. :286.0 Max. :376.0 Max. :564.0 Max. :10375
##
## Engines Wingspan..ft. Tail.Height..ft. Length..ft.
## Min. :2.000 Min. : 64.92 Min. :20.42 Min. : 65.58
## 1st Qu.:2.000 1st Qu.:107.83 1st Qu.:32.23 1st Qu.:116.94
## Median :2.000 Median :129.79 Median :44.50 Median :155.25
## Mean :2.045 Mean :132.10 Mean :42.57 Mean :149.94
## 3rd Qu.:2.000 3rd Qu.:156.08 3rd Qu.:52.00 3rd Qu.:180.25
## Max. :4.000 Max. :213.00 Max. :62.54 Max. :231.83
##
## Wifi Video Power Satellite
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :1.0000 Median :1.0000 Median :1.0000 Median :0.0000
## Mean :0.5682 Mean :0.6136 Mean :0.5682 Mean :0.1364
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
##
## Flat.bed Sleeper Club First.Class
## Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.00000 Median :1.0000
## Mean :0.2045 Mean :0.1364 Mean :0.02273 Mean :0.5455
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.0000
##
## Business Eco.Comfort Economy
## Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:1.0000
## Median :0.0000 Median :1.0000 Median :1.0000
## Mean :0.3864 Mean :0.8864 Mean :0.9773
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000
##
dataopt=data.frame(dataFLotteopt$Aircraft,dataFLotteopt$Length..ft.,dataFLotteopt$Wifi,dataFLotteopt$Video,dataFLotteopt$Power,dataFLotteopt$Satellite,dataFLotteopt$Flat.bed)
View(dataopt)
result7=PCA(dataopt,quali.sup =1)
inertia distribution :
eig.val <- get_eigenvalue(result7)
fviz_eig(result7, addlabels = TRUE, ylim = c(0, 50))
#correlation of variables (opt) :
fviz_pca_var(result7, col.var = "cos2",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Ãvite le chevauchement de texte,
)
#projection quality of options in dim 1 and 2:
fviz_cos2(result7, choice = "var", axes = 1:2)
#plot of airline:
fviz_pca_ind(result7,
geom.ind = "point", # Montre les points seulement (mais pas le "text")
palette = c("#00AFBB", "#E7B800", "#FC4E07"),
addEllipses = TRUE, # Ellipses de concentration
legend.title = "Groups"
)
the following graph shows us: * satellite option is strongly positively correlated with 2nd axis * options: power, flatbed video and airplane length are positively correlated with 1st axis * wifi option is positively correlated with 2nd axis and negatively with the first This is 3 remarks help us to distinguish the following results: * The aircraft 10,12,21 and 26 (Boeing 737-700 (73W), Boeing 737-800 (73H), Boeing 757-200 (75X), Boeing 767-300 (76Q)), generally the boeing plane, contain this options among other aircraft For example, aircraft number 14 (Boeing 747-400 (74S)) contain most of the options (power, length, flatbed, video) * also planes 38 (E120) and 41 (ERJ-145) do not contain any options (they are distorted for economy class)
projection of individuals(airlines):
fviz_pca_biplot(result7, axes = c(1, 2),
label = "all", invisible = "none", labelsize = 4, pointsize = 2)
numbers of planes by options:
dataopt=as.matrix(dataopt)
barplot(dataopt[,3:7],names = c("wifi","video","power","satellite","flat-bed"),main="nombres des avions par options",xlab="les options",ylab="nombre des avions")
fleet segmentation with options of airline:
1/db-scan:Density-based spatial clustering of applications with noise:
dataopt=as.data.frame(dataopt)
str(dataopt)
## 'data.frame': 44 obs. of 7 variables:
## $ dataFLotteopt.Aircraft : Factor w/ 44 levels "Airbus A319",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ dataFLotteopt.Length..ft.: Factor w/ 24 levels " 65.583"," 87.830",..: 8 8 11 11 20 20 22 22 10 7 ...
## $ dataFLotteopt.Wifi : Factor w/ 2 levels "0","1": 2 2 2 2 1 1 1 1 2 2 ...
## $ dataFLotteopt.Video : Factor w/ 2 levels "0","1": 1 2 1 1 2 2 2 2 1 2 ...
## $ dataFLotteopt.Power : Factor w/ 2 levels "0","1": 1 1 1 1 2 2 2 2 2 2 ...
## $ dataFLotteopt.Satellite : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 2 ...
## $ dataFLotteopt.Flat.bed : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 2 1 1 1 ...
dataopt$dataFLotteopt.Length..ft.=as.numeric(dataopt$dataFLotteopt.Length..ft.)
dataopt$dataFLotteopt.Wifi=as.numeric(dataopt$dataFLotteopt.Wifi)
dataopt$dataFLotteopt.Video=as.numeric(dataopt$dataFLotteopt.Video)
dataopt$dataFLotteopt.Power=as.numeric(dataopt$dataFLotteopt.Power)
dataopt$dataFLotteopt.Satellite=as.numeric(dataopt$dataFLotteopt.Satellite)
dataopt$dataFLotteopt.Flat.bed=as.numeric(dataopt$dataFLotteopt.Flat.bed)
str(dataopt)
## 'data.frame': 44 obs. of 7 variables:
## $ dataFLotteopt.Aircraft : Factor w/ 44 levels "Airbus A319",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ dataFLotteopt.Length..ft.: num 8 8 11 11 20 20 22 22 10 7 ...
## $ dataFLotteopt.Wifi : num 2 2 2 2 1 1 1 1 2 2 ...
## $ dataFLotteopt.Video : num 1 2 1 1 2 2 2 2 1 2 ...
## $ dataFLotteopt.Power : num 1 1 1 1 2 2 2 2 2 2 ...
## $ dataFLotteopt.Satellite : num 1 1 1 1 1 1 1 1 1 2 ...
## $ dataFLotteopt.Flat.bed : num 1 1 1 1 2 1 2 1 1 1 ...
d=dataopt[,2:7]
library("dbscan")
## Warning: package 'dbscan' was built under R version 3.4.3
library("ANN2")
## Warning: package 'ANN2' was built under R version 3.4.3
library("ggplot2")
#run dbscan
db <- dbscan(d, eps = 1, minPts = 3)
db
## DBSCAN clustering for 44 objects.
## Parameters: eps = 1, minPts = 3
## The clustering contains 6 cluster(s) and 10 noise points.
##
## 0 1 2 3 4 5 6
## 10 3 3 14 7 4 3
##
## Available fields: cluster, eps, minPts
#Scatterplot Matrices entre les options
plot(d, col=db$cluster)
hullplot(d, db)
#6 clusters
dataOption=data.frame(d,db$cluster)
classe1=dataOption[which(dataOption$db.cluster==1),]
classe2=dataOption[which(dataOption$db.cluster==2),]
classe3=dataOption[which(dataOption$db.cluster==3),]
classe4=dataOption[which(dataOption$db.cluster==4),]
classe5=dataOption[which(dataOption$db.cluster==5),]
classe6=dataOption[which(dataOption$db.cluster==6),]
dataOption
## dataFLotteopt.Length..ft. dataFLotteopt.Wifi dataFLotteopt.Video
## 1 8 2 1
## 2 8 2 2
## 3 11 2 1
## 4 11 2 1
## 5 20 1 2
## 6 20 1 2
## 7 22 1 2
## 8 22 1 2
## 9 10 2 1
## 10 7 2 2
## 11 7 2 2
## 12 12 2 2
## 13 14 2 2
## 14 24 1 2
## 15 17 1 2
## 16 17 2 2
## 17 17 2 1
## 18 17 2 1
## 19 11 2 2
## 20 17 2 2
## 21 17 2 2
## 22 18 2 2
## 23 19 1 2
## 24 19 1 2
## 25 19 2 2
## 26 19 2 2
## 27 19 1 2
## 28 19 1 2
## 29 19 1 2
## 30 19 1 2
## 31 21 1 2
## 32 23 1 2
## 33 23 1 2
## 34 2 1 1
## 35 2 1 1
## 36 6 2 1
## 37 9 2 1
## 38 1 1 1
## 39 4 2 1
## 40 5 2 1
## 41 3 1 1
## 42 15 2 1
## 43 16 2 1
## 44 13 2 1
## dataFLotteopt.Power dataFLotteopt.Satellite dataFLotteopt.Flat.bed
## 1 1 1 1
## 2 1 1 1
## 3 1 1 1
## 4 1 1 1
## 5 2 1 2
## 6 2 1 1
## 7 2 1 2
## 8 2 1 1
## 9 2 1 1
## 10 2 2 1
## 11 2 2 1
## 12 2 2 1
## 13 2 1 1
## 14 2 1 2
## 15 1 1 1
## 16 2 1 1
## 17 1 1 1
## 18 1 1 1
## 19 1 1 1
## 20 1 1 1
## 21 2 2 1
## 22 1 1 1
## 23 2 1 1
## 24 2 1 2
## 25 2 2 1
## 26 2 2 1
## 27 2 1 2
## 28 2 1 1
## 29 2 1 2
## 30 2 1 1
## 31 2 1 2
## 32 2 1 2
## 33 2 1 2
## 34 1 1 1
## 35 1 1 1
## 36 1 1 1
## 37 1 1 1
## 38 1 1 1
## 39 1 1 1
## 40 1 1 1
## 41 1 1 1
## 42 2 1 1
## 43 2 1 1
## 44 1 1 1
## db.cluster
## 1 1
## 2 1
## 3 2
## 4 2
## 5 3
## 6 3
## 7 3
## 8 3
## 9 0
## 10 0
## 11 0
## 12 0
## 13 0
## 14 3
## 15 4
## 16 4
## 17 4
## 18 4
## 19 2
## 20 4
## 21 4
## 22 4
## 23 3
## 24 3
## 25 0
## 26 0
## 27 3
## 28 3
## 29 3
## 30 3
## 31 3
## 32 3
## 33 3
## 34 5
## 35 5
## 36 6
## 37 1
## 38 5
## 39 6
## 40 6
## 41 5
## 42 0
## 43 0
## 44 0
View(dataOption)
option length with 6 clusters of db-scan
boxplot(classe1$dataFLotteopt.Length..ft.,classe2$dataFLotteopt.Length..ft.,classe3$dataFLotteopt.Length..ft.,classe4$dataFLotteopt.Length..ft.,classe5$dataFLotteopt.Length..ft.,classe6$dataFLotteopt.Length..ft.,col="violet")
option wifi with 6 clusters of db-scan
boxplot(classe1$dataFLotteopt.Wifi,classe2$dataFLotteopt.Wifi,classe3$dataFLotteopt.Length..ft.,classe4$dataFLotteopt.Wifi,classe5$dataFLotteopt.Wifi,classe6$dataFLotteopt.Wifi,col="blue")
option video with 6 clusters of db-scan
boxplot(classe1$dataFLotteopt.Video,classe2$dataFLotteopt.Video,classe3$dataFLotteopt.Video,classe4$dataFLotteopt.Video,classe5$dataFLotteopt.Video,classe6$dataFLotteopt.Video,col="red")
option power with 6 clusters of db-scan
boxplot(classe1$dataFLotteopt.Power,classe2$dataFLotteopt.Power,classe3$dataFLotteopt.Power,classe4$dataFLotteopt.Power,classe5$dataFLotteopt.Power,classe6$dataFLotteopt.Power.Video,col="yellow")
## Warning in is.na(x): is.na() appliqué à un objet de type 'NULL' qui n'est
## ni une liste, ni un vecteur
## Warning in is.na(x): is.na() appliqué à un objet de type 'NULL' qui n'est
## ni une liste, ni un vecteur
## Warning in is.na(x): is.na() appliqué à un objet de type 'NULL' qui n'est
## ni une liste, ni un vecteur
option satellite with 6 clusters of db-scan
boxplot(classe1$dataFLotteopt.Satellite,classe1$dataFLotteopt.Satellite,classe1$dataFLotteopt.Satellite,classe1$dataFLotteopt.Satellite,classe1$dataFLotteopt.Satellite,classe1$dataFLotteopt.Satellite,col="red3")
noise points are those which represent the planes seen in acp of which they are correlated with the 2nd axis and contain the option satellite Example: airline 10 and 12 with type Boeing
option flatbed with 6 clusters of db-scan
boxplot(classe1$dataFLotteopt.Flat.bed,classe2$dataFLotteopt.Flat.bed,classe3$dataFLotteopt.Flat.bed,classe4$dataFLotteopt.Flat.bed,classe5$dataFLotteopt.Flat.bed,classe6$dataFLotteopt.Flat.bed,col="grey")
Conclusion:
*cluster2 (red): these planes are larger than clluster1 but they are medium-sized remains, they are characterized by the existence of the wifi option as aircraft number 22: Boeing 757-300
*cluster4 (sky blue):represents the moderately large size class (close to cluster3 size) also the most equipped aircraft by video option like aircraft number 8 Airbus A330-300
*cluster5 (pink): represents smaller planes that are not equipped with any option (are the planes designed for economy class) such as aircraft number 35: CRJ 100/200 ExpressJet , 38: E120 and 41: ERJ-145.
*the noise points: represent the well-equipped planes by the satellite option like the plane 10 and 12
to study other aspects of the world fleet we used another richer database
allplaines=read.table(file = file.choose(),header = TRUE,dec = ".", sep = ",")
head(allplaines)
## Airline AircraftWithSeatmap SeatPitch SeatWidth
## 1 Aegean Airlines Airbus A321 (321) 33.0 18
## 2 Aer Lingus Airbus A330-200 (332) V1 58.0 21
## 3 Aer Lingus Airbus A330-300 (333) 58.0 21
## 4 Aer Lingus Boeing 757-200 (752) 60.0 22
## 5 Aer Lingus Airbus A330-200 (332) V2 58.0 21
## 6 Aeroflot Airbus A330-300 (333) V1 60.2 26
## VideoType LaptopPower PowerType Wi.Fi SeatType Class
## 1 Overhead TV None None No Standard LHBusiness
## 2 On-Demand TV All Seats AC Power Yes Flat Bed LHBusiness
## 3 On-Demand TV All Seats AC Power Yes Flat Bed LHBusiness
## 4 On-Demand TV All Seats AC Power No Flat Bed LHBusiness
## 5 On-Demand TV All Seats AC Power Yes Flat Bed LHBusiness
## 6 On-Demand TV All Seats AC Power Yes Angle Lie Flat LHBusiness
We will opt in this part to do the CA to release the average profile of the world fleet and to make it we transformed our base into a table of contingencies
feqTab=read.table(file = file.choose(),header = TRUE,dec = ".",sep = ",")
head(feqTab)
## X airlines videoNonelist videoOnDemandlist
## 1 1 Aegean Airlines 5 0
## 2 2 Aer Lingus 2 8
## 3 3 Aeroflot 6 12
## 4 4 Aerolineas Argentinas 4 0
## 5 5 Aeromexico 5 14
## 6 6 Air Canada 5 41
## videoOverHeadlist videoPortablelist videoSatellitelist videoSeatbacklist
## 1 4 0 0 0
## 2 0 0 0 0
## 3 0 6 0 1
## 4 1 0 0 3
## 5 3 0 0 0
## 6 0 9 0 0
## powerAllSeatslist powersomeSeatslist powernonelist WifiNolist
## 1 0 0 9 9
## 2 7 0 3 4
## 3 16 0 9 14
## 4 0 0 8 6
## 5 3 0 19 14
## 6 33 13 9 48
## WifiYeslist SeatAngleLieFalatlist SeatClosedSuitelist SeatFlatBedlist
## 1 0 0 0 0
## 2 6 0 0 4
## 3 11 4 0 1
## 4 2 2 0 0
## 5 8 1 0 2
## 6 7 0 0 4
## SeatReclinerlist SeatOpenSuitelist SeatStandardlist ClassLHBusinesslist
## 1 0 0 9 1
## 2 0 0 6 4
## 3 8 0 12 5
## 4 2 0 4 2
## 5 3 0 16 3
## 6 15 5 31 10
## ClassLHEconomylist ClassLHFirstlist ClassPremEcolist ClassSHEconomylist
## 1 1 0 0 5
## 2 4 0 0 2
## 3 5 0 1 7
## 4 2 0 0 2
## 5 4 0 2 10
## 6 11 0 6 19
## ClassSHFirst_businesslist
## 1 2
## 2 0
## 3 7
## 4 2
## 5 3
## 6 9
feqTab=feqTab[,3:25]
dt <- as.table(as.matrix (feqTab))
ress=CA(feqTab,graph = TRUE)
plot.CA(ress,invisible = "col",axes =3:4)
plot.CA(ress,invisible = "col",axes =1:2)
plot.CA(ress,invisible = "col",axes =c(1,4))
plot.CA(ress,invisible = "col",axes =c(1,3))
plot.CA(ress,invisible = "col",axes =c(2,4))
plot.CA(ress,invisible = "col",axes =c(2,3))
plot.CA(ress,invisible = "row",axes =3:4)
plot.CA(ress,invisible = "row",axes =1:2)
plot.CA(ress,invisible = "row",axes =c(1,4))
plot.CA(ress,invisible = "row",axes =c(1,3))
plot.CA(ress,invisible = "row",axes =c(2,4))
plot.CA(ress,invisible = "row",axes =c(2,3))
ress$col$contrib
## Dim 1 Dim 2 Dim 3
## videoNonelist 19.32611585 7.868694e+00 3.745716e-02
## videoOnDemandlist 13.67766852 2.332672e-04 6.477920e-03
## videoOverHeadlist 1.34712561 1.203668e+01 6.184462e-01
## videoPortablelist 0.23205955 2.384432e-02 4.405033e+01
## videoSatellitelist 0.07164354 2.471761e+00 4.920229e+01
## videoSeatbacklist 0.03645092 8.981892e-01 1.201527e+00
## powerAllSeatslist 10.97952021 1.079499e+00 1.088006e-01
## powersomeSeatslist 0.88057202 1.754929e-01 3.385283e-04
## powernonelist 12.01893595 1.241006e+00 9.763545e-02
## WifiNolist 1.14869126 2.049700e+01 6.372128e-02
## WifiYeslist 2.54681901 4.590084e+01 1.470905e-01
## SeatAngleLieFalatlist 1.04055223 2.231312e+00 2.058463e-02
## SeatClosedSuitelist 2.29653649 6.337368e-03 2.092256e-01
## SeatFlatBedlist 4.09543271 6.361641e-01 2.538286e-01
## SeatReclinerlist 0.18326268 3.741532e-02 1.803515e+00
## SeatOpenSuitelist 1.03096763 9.951732e-02 1.055514e-04
## SeatStandardlist 3.68014207 4.117455e-01 1.622015e-01
## ClassLHBusinesslist 4.14343230 1.115904e+00 2.647042e-01
## ClassLHEconomylist 2.54404229 1.929653e-01 1.878562e-01
## ClassLHFirstlist 3.99179160 8.705174e-01 1.079273e-03
## ClassPremEcolist 0.85542478 4.116558e-04 1.141408e+00
## ClassSHEconomylist 13.27884317 2.176839e+00 1.342732e-02
## ClassSHFirst_businesslist 0.59396962 2.764007e-02 4.079429e-01
## Dim 4 Dim 5
## videoNonelist 1.123396e+00 17.5088192
## videoOnDemandlist 3.529154e-01 0.6477534
## videoOverHeadlist 2.800061e+00 44.7035374
## videoPortablelist 4.325153e+01 1.2288282
## videoSatellitelist 4.229722e+01 2.0356605
## videoSeatbacklist 6.025549e-01 0.3136155
## powerAllSeatslist 2.463480e-01 0.1983815
## powersomeSeatslist 2.605507e+00 2.7440790
## powernonelist 7.689015e-01 0.7009572
## WifiNolist 1.486931e+00 4.1987087
## WifiYeslist 3.314637e+00 9.3716958
## SeatAngleLieFalatlist 5.601429e-02 0.3868666
## SeatClosedSuitelist 4.560790e-01 2.6036452
## SeatFlatBedlist 3.866178e-04 1.1137736
## SeatReclinerlist 7.144350e-02 4.3863323
## SeatOpenSuitelist 2.146411e-01 0.1074232
## SeatStandardlist 1.016360e-02 0.1240526
## ClassLHBusinesslist 7.150024e-02 0.1788201
## ClassLHEconomylist 3.202444e-04 0.2655740
## ClassLHFirstlist 1.277713e-01 2.0785023
## ClassPremEcolist 2.578266e-02 0.3190988
## ClassSHEconomylist 7.949847e-02 0.1143678
## ClassSHFirst_businesslist 3.639562e-02 4.6695072
the characteristics of the average profile are all around the origin, by examining the coordinates of the variables, we found that the average global profile is summarized as follows:
Seat:recliner
Wfi:No_wifi
Power :some seats
video:Portable TV satellite TV et seatback tv
classe : Premium economy , first class , business class
#ACP
library(FactoMineR)
result_pca <- PCA(data[,2:33], graph = FALSE, ncp=3)
result_pca
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 44 individuals, described by 32 variables
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$var" "results for the variables"
## 3 "$var$coord" "coord. for the variables"
## 4 "$var$cor" "correlations variables - dimensions"
## 5 "$var$cos2" "cos2 for the variables"
## 6 "$var$contrib" "contributions of the variables"
## 7 "$ind" "results for the individuals"
## 8 "$ind$coord" "coord. for the individuals"
## 9 "$ind$cos2" "cos2 for the individuals"
## 10 "$ind$contrib" "contributions of the individuals"
## 11 "$call" "summary statistics"
## 12 "$call$centre" "mean of the variables"
## 13 "$call$ecart.type" "standard error of the variables"
## 14 "$call$row.w" "weights for the individuals"
## 15 "$call$col.w" "weights for the variables"
plot(result_pca)
library("factoextra")
eig.val <- get_eigenvalue(result_pca)
eig.val
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 1.349103e+01 4.215946e+01 42.15946
## Dim.2 7.164241e+00 2.238825e+01 64.54772
## Dim.3 4.917618e+00 1.536756e+01 79.91527
## Dim.4 1.499489e+00 4.685902e+00 84.60117
## Dim.5 1.330906e+00 4.159082e+00 88.76026
## Dim.6 1.053693e+00 3.292790e+00 92.05305
## Dim.7 7.891997e-01 2.466249e+00 94.51930
## Dim.8 5.070555e-01 1.584548e+00 96.10384
## Dim.9 3.499964e-01 1.093739e+00 97.19758
## Dim.10 2.859749e-01 8.936716e-01 98.09125
## Dim.11 1.700615e-01 5.314422e-01 98.62270
## Dim.12 1.341398e-01 4.191870e-01 99.04188
## Dim.13 9.131570e-02 2.853616e-01 99.32725
## Dim.14 5.817774e-02 1.818054e-01 99.50905
## Dim.15 4.008013e-02 1.252504e-01 99.63430
## Dim.16 3.157443e-02 9.867009e-02 99.73297
## Dim.17 2.133028e-02 6.665711e-02 99.79963
## Dim.18 1.960192e-02 6.125600e-02 99.86088
## Dim.19 1.406561e-02 4.395504e-02 99.90484
## Dim.20 1.113021e-02 3.478190e-02 99.93962
## Dim.21 7.443154e-03 2.325986e-02 99.96288
## Dim.22 6.277314e-03 1.961661e-02 99.98250
## Dim.23 2.689596e-03 8.404989e-03 99.99090
## Dim.24 2.044121e-03 6.387877e-03 99.99729
## Dim.25 4.953655e-04 1.548017e-03 99.99884
## Dim.26 2.394094e-04 7.481544e-04 99.99959
## Dim.27 1.188629e-04 3.714464e-04 99.99996
## Dim.28 1.338488e-05 4.182776e-05 100.00000
## Dim.29 3.621700e-30 1.131781e-29 100.00000
## Dim.30 1.836758e-30 5.739870e-30 100.00000
## Dim.31 1.004447e-31 3.138896e-31 100.00000
## Dim.32 1.004447e-31 3.138896e-31 100.00000
fviz_eig(result_pca, addlabels = TRUE, ylim = c(0, 50))
fviz_pca_var(result_pca, col.var = "cos2",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Ãvite le chevauchement de texte,
)
library("rgl")
## Warning: package 'rgl' was built under R version 3.4.3
plot3d(x=result_pca$ind$coord[,"Dim.1"],y=result_pca$ind$coord[,"Dim.2"],z=result_pca$ind$coord[,"Dim.3"],xlab = "dim1",choix = "ind",ylab = "dim2",zlab = "dim3")
#centrage et réduction
data_scale=scale(data[,c(2:33)])
d=dist(data_scale,method="manhattan")
cah <- hclust(d, method="ward.D2")
barplot(cah$height)
#visualisation du methode
plot(cah)
groupes.cah <- cutree(cah,k=4 )
plot(groupes.cah)
table(groupes.cah)
## groupes.cah
## 1 2 3 4
## 23 1 16 4
barplot(table(groupes.cah))
#segmentation n'est pas équitable
fviz_cos2(result_pca, choice = "var", axes = 1:2)
inertie <- sort(cah$height, decreasing = TRUE)
plot(inertie[1:10], type = "s", xlab = "Nombre de classes",
ylab = "Inertie")
points(c(2, 3, 4,6), inertie[c(2, 3,4,6)], col = c("green3",
"red3", "blue3","yellow3"), cex = 2, lwd = 3)
plot(cah, labels = FALSE, main = "Partition en 2, 3,4 ou 5 classes",
xlab = "", ylab = "", sub = "", axes = FALSE, hang = -1)
rect.hclust(cah, 2, border = "green3")
rect.hclust(cah, 3, border = "red3")
rect.hclust(cah, 4, border = "blue3")
rect.hclust(cah, 6, border = "yellow3")
library(devtools)
## Warning: package 'devtools' was built under R version 3.4.3
library(JLutils)
## Loading required package: plyr
best.cutree(cah)
## [1] 4
#the best.cutree function looks what would be the best partition between 2, 3,4 and 6 classes
#The partition into 4 classes will be best depending on the result.
best.cutree(cah, graph = TRUE, xlab = "Nombre de classes",
ylab = "Inertie relative")
## [1] 4
library(RColorBrewer)
A2Rplot(cah, k = 4, boxes = FALSE, col.up = "gray50",
col.down = brewer.pal(4, "Dark2"), show.labels = FALSE)
library(pvclust)
## Warning: package 'pvclust' was built under R version 3.4.3
fit <- pvclust(data_scale, method.hclust="ward.D2",
method.dist="euclidean")
## Bootstrap (r = 0.5)... Done.
## Bootstrap (r = 0.59)... Done.
## Bootstrap (r = 0.68)... Done.
## Bootstrap (r = 0.8)... Done.
## Bootstrap (r = 0.89)... Done.
## Bootstrap (r = 1.0)... Done.
## Bootstrap (r = 1.09)... Done.
## Bootstrap (r = 1.18)... Done.
## Bootstrap (r = 1.3)... Done.
## Bootstrap (r = 1.39)... Done.
plot(fit) # dendogram with p values
# add rectangles around groups highly supported by the data
pvrect(fit, alpha=.95)
#clustering en éliminant les points abbérants
data_scale_spab=scale(data[-2,c(2:33)])
d1=dist(data_scale_spab,method="manhattan")
cah2 <- hclust(d1, method="ward.D2")
plot(cah2)
groupes.cah <- cutree(cah2,k=3)
plot(groupes.cah)
table(groupes.cah)
## groupes.cah
## 1 2 3
## 24 15 4
barplot(table(groupes.cah))
#segmentation n'est pas équitable
The HCPC function (Hierarchical Classification on Principal Components) allows unsupervised classification of individuals. This function combines the main factors, heroic classification and partitioning to better visualize and emphasize the similarities between individuals. HCPC performs both the calculation of the distance matrix, the dendrogram and the partitioning of the population into classes.
plot(cah2)
res <- HCPC(result_pca,graph = FALSE)
plot(res, choice = "tree" )
#donner Graphique 3D combinant la classification hiérarchique et le plan des facteurs(visualisation)
# Principal components + tree
plot(res, choice = "3D.map")
plot(res, choice = "bar")
plot(res, choice = "map")
fviz_dend(res,
cex = 0.7, # Taille du text
palette = "jco", # Palette de couleur ?ggpubr::ggpar
rect = TRUE, rect_fill = TRUE, # Rectangle autour des groupes
rect_border = "jco", # Couleur du rectangle
labels_track_height = 0.8 )
#le dendograme suggère une solution de 4 groupes
fviz_cluster(res,
repel = TRUE, # Evite le chevauchement des textes
show.clust.cent = TRUE, # Montre le centre des clusters
palette = "jco", # Palette de couleurs, voir ?ggpubr::ggpar
ggtheme = theme_minimal(),
main = "Factor map"
)
K-Means clustering taking into account outliers with 4 clusters
fit <- kmeans(data_scale, 4)
# vary parameters for most readable graph
library(cluster)
clusplot(data_scale, fit$cluster, color=TRUE, shade=TRUE,
labels=2, lines=0)
groupes.kmeans <- kmeans(data_scale,centers=3,nstart=4)
#affichage des résultats
print(groupes.kmeans)
## K-means clustering with 3 clusters of sizes 1, 15, 28
##
## Cluster means:
## Seat.Width..Club. Seat.Pitch..Club. Seat..Club. Seat.Width..First.Class.
## 1 6.4824939 6.4824939 6.4824939 0.7791285
## 2 -0.1507557 -0.1507557 -0.1507557 -1.1314261
## 3 -0.1507557 -0.1507557 -0.1507557 0.5782951
## Seat.Pitch..First.Class. Seats..First.Class. Seat.Width..Business.
## 1 0.9255149 1.6159379 1.3315294
## 2 -1.1144428 -0.9762958 1.3051387
## 3 0.5639688 0.4653035 -0.7467361
## Seat.Pitch..Business. Seats..Business. Seat.Width..Eco.Comfort.
## 1 0.9543069 0.1267015 -2.7534624
## 2 1.2971462 1.3049355 0.3890878
## 3 -0.7289822 -0.7035977 -0.1101019
## Seat.Pitch..Eco.Comfort. Seats..Eco.Comfort. Seat.Width..Economy.
## 1 -2.7580545 -1.8992508 -6.3970861
## 2 0.3986931 1.0195399 0.2295612
## 3 -0.1150837 -0.4783517 0.1054882
## Seat.Pitch..Economy. Seats..Economy. Accommodation Cruising.Speed..mph.
## 1 -6.4538472 -2.1723401 -1.4905983 0.04441488
## 2 0.1466345 0.8188510 0.9086150 0.45215440
## 3 0.1519403 -0.3610866 -0.4335224 -0.24381181
## Range..miles. Engines Wingspan..ft. Tail.Height..ft. Length..ft.
## 1 -0.3284213 -0.1507557 -0.4796928 -0.3250485 -0.9664471
## 2 1.1303122 0.2914610 1.0549468 0.9926835 1.0074765
## 3 -0.5937951 -0.1507557 -0.5480182 -0.5201858 -0.5052036
## Wifi Video Power Satellite Flat.bed Sleeper
## 1 0.8618163 0.7844233 -1.1339688 -0.3928183 -0.5012970 -0.3928183
## 2 -0.8678641 0.7844233 0.8618163 -0.2007738 0.9691742 0.5674042
## 3 0.4341480 -0.4482419 -0.4211884 0.1215866 -0.5012970 -0.2899373
## Club First.Class Business Eco.Comfort
## 1 6.4824939 0.9024378 1.2458487 -2.76092870
## 2 -0.1507557 -1.0829253 1.2458487 0.35396522
## 3 -0.1507557 0.5479086 -0.7119136 -0.09101963
##
## Clustering vector:
## [1] 3 1 3 3 2 2 2 2 3 3 3 3 3 2 3 2 3 3 3 3 2 3 2 2 3 3 2 3 2 2 2 2 2 3 3
## [36] 3 3 3 3 3 3 3 3 3
##
## Within cluster sum of squares by cluster:
## [1] 0.000 150.626 417.797
## (between_SS / total_SS = 58.7 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
One method to validate the number of clusters is the elbow method. The idea of the elbow method is to run k-means clustering on the dataset for a range of values of k (say, k from 1 to 10 in the examples above), and for each value of k calculate the sum of squared errors (SSE)
fviz_nbclust(data_scale, kmeans, method = "wss") +
geom_vline(xintercept = 4, linetype = 2)+
labs(subtitle = "Elbow method")
fviz_nbclust(data_scale, kmeans, method = "silhouette")+
labs(subtitle = "Silhouette method")
library(fpc)
## Warning: package 'fpc' was built under R version 3.4.3
##
## Attaching package: 'fpc'
## The following object is masked from 'package:dbscan':
##
## dbscan
sol.kmeans <- kmeansruns(data_scale,krange=2:10,criterion="ch")
plot(1:10,sol.kmeans$crit,type="b",xlab="Nb. de groupes",ylab="Silhouette")
#From k = 4 clusters, the addition of an additional group does not "significantly" increase the share of inertia explained by the partition.